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Training an imbalanced dataset can cause classifiers to overfit the majority 
class and increase the possibility of information loss for the minority class. 
Moreover, accuracy may not give a clear picture of the classifier’s 
performance. This paper utilized decision tree (DT), support vector machine 
(SVM), artificial neural networks (ANN), K-nearest neighbors (KNN) and 
Naive Bayes (NB) besides ensemble models like random forest (RF) and 
gradient boosting (GB), which use bagging and boosting methods, three 
sampling approaches and seven performance metrics to investigate the effect 
of class imbalance on water quality data. Based on the results, the best model 
was gradient boosting without resampling for almost all metrics except 
balanced accuracy, sensitivity and area under the curve (AUC), followed by 
random forest model without resampling in term of specificity, precision and 
AUC. However, in term of balanced accuracy and sensitivity, the highest 
performance was achieved by random forest with a random under-sampling 
dataset. Focusing on each performance metric separately, the results showed 
that for specificity and precision, it is better not to preprocess all the ensemble 
classifiers. Nevertheless, the results for balanced accuracy and sensitivity 
showed improvement for both ensemble classifiers when using all the 
resampled dataset. 
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1. INTRODUCTION 


One of the main challenges of machine learning is the processing of imbalance data for classification 
tasks [1]. Recently, the classification of imbalanced data becomes a highly explored issue because when 
imbalanced data occurred, classifiers have a tendency to produce a biased model with close to zero sensitivity 
for the minority class. Even not a single minority class sample is classified correctly, the accuracy can reach 
up to 99% as most majority classes were classified correctly. In other words, accuracy will not give a clear 
picture of the classifier’s performance in an imbalanced dataset. Issues of imbalanced data occurred in many 
fields such as bankruptcy risk data [2], credit scoring [3], healthcare medical data [4], student performance [5], 
point cloud data [6], anomalies detection [7] and also water quality data [8]. In real-world applications, the 
severity of class imbalance may range from mild to severe [9]. The severity of imbalance is said to be mild if 


Journal homepage: http://ijeecs.iaescore.com 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 im) 599 


the proportion of minority class is between 20%-40%, moderately imbalance if less than 20% of the data and 
extreme if less than 1% of the data. A classifier applied without any strategy to process imbalanced data will 
tend to ignore the minority class and, as a result, will almost inevitably classify it incorrectly. 

Basically, there are three approaches to deal with imbalanced data which are data level, algorithm 
level and ensemble methods [10]. The data-level approach consists of re-sampling the data to reduce class 
imbalance. There are two basic re-sampling techniques which are under-sampling the majority class and 
oversampling the minority class. Among oversampling techniques, the most fundamental technique is random 
oversampling (ROS). Rachburee and Punlumjeak [5] applied adaptive synthetic (ADASYN) method, synthetic 
minority oversampling technique (SMOTE), SVMSMOTE and Borderline-SMOTE to predict student 
performance. They found that Borderline-SMOTE method gave the best prediction result using several 
classifiers. For under-sampling, random under-sampling (RUS) is the most popular under-sampling technique. 
Some researchers have opt to combine both oversampling and under-sampling techniques which is called 
hybrid sampling [6], [11]. These techniques are used to produce a balanced dataset which make the classifiers 
not biased toward one class or another. Lin and Nguyen [6] used the hybrid sampling technique which involved 
ROS followed by RUS with a balance loss cost function to resolve imbalanced data. They found that 
oversampling followed by under-sampling was more effective than under-sampling followed by over- 
sampling. They also found that the proposed method improved performance by 7%. The advantages of ROS- 
RUS method are that it implies nothing about the data, simple and no heuristic is used [6]. Another study by 
[12] combined oversampling technique, SMOTE and under-sampling technique to cater the imbalanced issue 
on 10 datasets. They found that the hybrid sampling had better performance compared to the other technique. 

Second, in algorithm level approach, machine learning model is modified to adapt the imbalanced 
data. Next, the third approach is ensemble method. Ensemble method combines several base learners’ decision 
to produce more precise prediction than each base learner's decision [13]. There are two commonly used 
ensemble families in machine learning which are bagging and boosting. Bootstrap aggregating or bagging is a 
method that learns multiple base classifiers in parallel. The advantage of this bagging method is that it can 
lower the variance while retaining low bias of the base classifiers. This is done by averaging outputs from base 
classifiers [11]. Boosting method also works by combining multiple base learners. However, it trains the 
multiple learners in sequential way [14]. The weights are allocated to the instances by each learner and then 
the weighted instances are utilized by the next learner. The weights of instances which are incorrectly classified 
are increased, while the instances’ weights that are correctly classified are decreased. Both bagging and 
boosting methods provide higher stability to the classifiers and are good in reducing variance. In a previous 
study, they compared the performance of a single model and modified ensemble bagging model by using 
banking financial ratios data. The results showed that the modified ensemble bagging model was always more 
accurate compared to the single model [2]. This is supported by another study [15] which found that ensemble 
bagging model increased the performance of decision trees C4.5 and CART model. Evangelista and Sy [16] 
used four ensemble models which are homogeneous ensembles (boosting and bagging) and heterogeneous 
ensembles (stacking and voting) to enhance different single classifier’s performance. The results in the study 
revealed that voting ensemble model performed slightly better than boosting and bagging models. Meanwhile, 
Priasni and Oswari [17] applied three ensemble learning models which are voting, Adaboost and bagging to 
the Naive Bayes, decision tree and support vector machine classifiers. They found that Adaboost model using 
decision tree as base classifier had the highest accuracy and precision while bagging model using support vector 
machine as base classifier had the highest f-measure, area under the curve (AUC) and recall. 

However, ensemble methods which employs resampling techniques are expected to work better in 
handling imbalanced data. This was proven by [11] when they found that their hybrid sampling combined with 
bagging model (RS YNBagging) had the best classification performance based on the AUC-ROC plot. This 
study demonstrated the advantage of combining oversampling and under-sampling techniques with ensemble 
model to cater imbalanced class issue. Lu et al. [18] also used hybrid sampling with bagging (HSBagging) which 
adopted random under-sampling technique and SMOTE integrated with bagging algorithm. The study found 
that HSBagging outperformed the other related UnderBagging and SMOTEBagging methods. Many 
researchers used machine learning models such as random forest (RF) [19], [20], extreme gradient boosting 
(XGBoost) [3], ensemble models [21], [22], hybridization of random forest and extreme gradient boosting [23], 
gradient boosting (GB) and conventional machine learning model such as decision tree (DT), artificial neural 
network (ANN), k-nearest neighbors (KNN), support vector machine (SVM) and Naive Bayes (NB) [8] for 
classification. However, there is still no conclusive evidence as to which is the best approach. The aim of this 
study is therefore to investigate the predictive performance of five conventional machine learning models 
which are SVM, NB, KNN, ANN, DT and two popular ensemble models which are random forest and gradient 
boosting using three resampling techniques such as random oversampling (ROS), random under-sampling 
(RUS) and hybrid sampling of ROS-RUS method on the imbalanced water quality classification (WQC) 
dataset. This paper is organized as follows: section 2 describes the methodology for evaluating the machine 
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learning models with application of sampling techniques (ROS, RUS and ROS-RUS). The results are presents 
and discussed in section 3 and the conclusion is given in section 4. 


2. METHOD 
2.1. Water quality data 

This study used secondary data on various parameters of water quality which were obtained from 
Department of Environment (DOE) Malaysia. DOE performs regular water quality monitoring of Kelantan River 
for 4, 5 or 6 times per year based on the stations. Kelantan River is one of the main rivers in Malaysia which is 
located in the north-east of peninsular Malaysia. The data are for 2005 to 2020. In 2005 until 2015, the data were 
from 8 stations situated along Kelantan River, namely Jambatan Kusia, Jambatan Sultan Yahya Petra, Kota Bahru, 
Tangga Kerai, Bandar Kuala Kerai, Jambatan bandar Rantau Panjang-Golok, Kampung Kuala Sat, Jeli, Kampung 
Bukit Bunga, Kampung Lubok Setol and Kampung Jeram Perdah. Later, in 2016, data from a new station at Loji 
Air Lemal, Pasir Mas was included. In 2018, three new monitoring stations were added in Kelantan River which 
are Sg. Relai, Loji Ayer Lanas and Skim Bekalan Air Merbau Chondong. Hence, giving the total observations in 
this study is 685 observations measured 4, 5 or 6 times per year for 16 years at 12 locations. The dataset consists 
of the target variable which is the water quality classification (WQC) and 13 physicochemical parameters which 
are dissolved oxygen (DO), biochemical oxygen demand (BOD), chemical oxygen demand (COD), total 
suspended solid (TSS), pH, Ammoniacal Nitrogen (NH3-N), temperature, conductivity, salinity, turbidity, 
nitrogen (NO3), phosphorus (POs) and Escherichia coli (E-coli). WQC are constructed based on the water quality 
index value range as shown in Table |. The water quality is classified as clean if the WQI value range between 
81 to 100 and slightly polluted if range between 60 to 80 [8]. 


Table 1. Water quality classification 

Water quality classification 
Slightly Polluted Clean 
Water quality index 60-80 81-100 


Parameter 


2.2. Data pre-processing 

Data pre-processing is a vital step to prepare the data before developing water quality predictive 
models using machine learning classifier. It involves a number of important steps, such as data clean-up, data 
transformation and feature selection. Data clean-up and transformation are methods used to remove outliers 
and standardize data to have similar units. This study used z-score method to standardize the data and 
Mahalanobis distance to detect outliers. Based on the Mahalanobis distance, 27 outliers were detected in the 
dataset and removed from the dataset. The number of remaining samples is 658. Next, for missing values 
analysis, only 3 variables which are turbidity, phosphorus and E-coli has missing values with the missing 
percentage of 1.0%, 1.8% and 1.0% respectively. The missing values were imputed using expectation 
maximization (EM) method. This study used R programming software to analyse the data. 


2.3. Conventional machine learning models 
2.3.1. K-nearest neighbours 

This K-nearest neighbours (KNN) algorithm classifies the samples by discovering the given points 
nearest neighbours and assigns the class of majority of K neighbours to it. In the event of a draw, different 
techniques could be used to solve it. However, KNN is not suggested for large data set since all processing 
occurs during the testing, and it iterates through all the training data and calculates the nearest neighbours each 
time [24]. This study used K = 10 configuration for the KNN model. 


2.3.2. Support vector machines 

Support vector machine (SVM) is one of the classifying methods based on the theory of statistical 
learning. SVM uses the structural risk minimization principle to address overfitting problem in machine 
learning by reducing the model’s complexity and fitting the training data successfully. Minimization of risk 
can enhance the generalization of the SVM model [25]. Estimates of the SVM model are created based on 
small sub-set of training data which is known as support vector. The capability to interpret support vector 
machine decisions can be improved by recognizing vectors that are chosen as support vector [26]. SVM maps 
the initial data in a high-dimension feature space in which an optimal separating plane is created by using 
suitable kernel function. For classification, the optimal separating plane is the line that dividing the plane into 
two parts and each class is placed into different side. Along each part of the separating plane, 2 parallel 
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hyperplanes could be built to separate the training data. The hyperplane is optimal if the margin between closest 
training vector and the hyperplane is maximal. This study used complexity constant, C=5 to set the 
misclassification tolerance. Large value of C can lead to overfitting problem while small value may cause over 
generalization. This study used the polynomial kernel since it is suitable for the case where all training data are 
normalized. 


2.3.3. Artificial neural network 

Artificial neural network (ANN) works like a human brain's nervous system which comprises of 
interconnected neurons that work together in parallel [8]. It is widely used in many fields because of its 
advantages such as self-organizing, self-learning and self-adapting abilities. Neural network’s structure is 
composed of 3 layers which are the input, middle and output layer. Input variables are entered into the algorithm 
in the input layer. In the middle layer, the input variables are multiplied by weights before they are summed by 
a constant value. Then, an activation function is added to the sum of the weighted inputs. Activation function 
are needed to transform the input signals into output signals. Recent artificial neural network algorithms employ 
activation functions that are non-linear [27]. This is because non-linear activation functions allow 
backpropagation and multi-layer neurons stacking to produce complex mapping between input and output 
networks which are needed to study complex dataset. Most popular activation functions are Gaussian, Sigmoid 
and Tansig. In the output layer, the prediction is obtained from the parallel computation in the middle layer. 
The mathematical formula of neuron computation is given by J; = f (d Wa; + 6; where wy are the weights, 
a; are the input variables and 0; are the biases. This study used the default hidden layer which consist of one 
hidden layer with Sigmoid activation function and size equal to (number of attributes + number of classes)/2+1. 


2.3.4. Decision tree 

Decision tree (DT) is a simple and explicit algorithm that makes decisions based on values from all 
relevant input parameters. DT uses entropy to select the root variable and based on that, it looks to the values 
of the other parameters. It has all the parameter decisions organized in a tree from top to bottom and plans the 
decision based on different values of different parameters [28]. Decision tree models frequently found in 
previous studies to perform well on imbalanced data. However, decision tree-based ensembles models 
including random forests (RF) and gradient boosting (GB) almost always outperform the single decision tree. 
The advantages of decision tree-based model are not sensitive to missing values, ability to manage both regular 
attributes and data and highly efficient. 


2.3.5. Naive Bayes 

Bayes approach employs probability statistics knowledge to classify the data and estimate the 
outcome. The Bayes model uses prior and posterior probabilities in order to prevent overfitting problem and 
bias from using only sample information [29]. A classification technique that uses Bayes theorem and the 
independent conditions assumption is known as Naive Bayes (NB). When the target value is specified, the 
attributes are meant to be conditionally independent from each other [29]. This technique makes the complexity 
of the Bayes model much simpler. The probability of event A occurs given that event B occurred is different 
from the probability of event B occurs given that event A occurred. Assume that Aj, Az, --, An are the event 
vectors and B is the dataset class, hence the Naive Bayes formula may be written as shown in (1): 


P(BIAy, Ag) .0+, Aq) = 2 OPE ateerdnl 5) ee (1) 


where the P(A) is a prior probability that represents the event vectors and P(A,|B) is the dataset class prior 
probability. This study used default values for this algorithm. 


2.4. Ensemble methods for imbalanced problem 
2.4.1. Bagging ensemble method for machine learning 

Random forest is a classification model that uses multiple base models typically decision trees, on a 
given subset of data independently and makes decisions based on all models [30]. It uses feature randomness 
and bagging when building each individual decision tree to produce independent forest of trees. RF is a method 
of calculating the mean of several deep decision trees formed in different parts of the same training set, with 
the aim of reducing the variance. The prediction by this committee is more accurate than that of any individual 
tree and robust against overfitting. In a random forest, each node is split using the best among a subset of 
predictors randomly chosen at that node [31]. RF algorithm works by creating nee bootstrap sub-samples of 
original dataset with replacement first. Then, for each bootstrap samples, train a decision tree model. The new 
data are predicted by aggregating the prediction of the ntree models (majority votes for classification). 
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2.4.2. Boosting ensemble method for machine learning 

Gradient boosting is a boosting-based machine learning algorithm which trains multiple weak 
classifiers typically decision tree to create a robust classifier for regression and classification problems [32]. It 
assembles the model in a stage-wise way similar to the what the other boosting techniques do and it generalizes 
them by optimizing a suitable cost function. In the GB algorithm, incorrectly classified cases for a step are 
given increased weight during the next step. The advantages of GB are that it has exceptional accuracy in 
predicting and fast process. 


2.5. Sampling techniques 

This section outlines three sampling techniques utilized in this study to address the issue of 
imbalanced data. Random under-sampling, random oversampling and hybrid sampling ROS-RUS are among 
the approaches used. The details about each sampling technique are discussed briefly below. 


2.5.1. Random under-sampling 

Random under-sampling (RUS) method works by randomly removing the instances of the majority 
class until a certain desired majority-to-minority ratio is achieved. However, the drawback of this method is it 
may delete useful data which cause information loss [18]. This random deletion may also modify the majority 
class distribution and therefore modify their representative features. When this occurs, a large number of 
majority cases will be misclassified. However, despite these drawbacks, RUS generally works better than other 
under-sampling methods [11]. 


2.5.2. Random oversampling 

Among oversampling techniques, the most fundamental technique is random oversampling. In 
random oversampling (ROS), minority class samples are randomly selected and duplicated till the data become 
balanced [11]. Nevertheless, this approach has led to overfitting problem where the classifiers become biased 
to the duplicated samples. Consequently, the classifiers are not able to classify new instances correctly. 


2.5.3. Hybrid sampling 

Interesting results can be obtained by combining random oversampling with random under-sampling. 
The classifier’s performance could be enhanced to a greater extent. In the Hybrid sampling (ROS-RUS) 
method, the minority class data is mixed with the majority class data after oversampling and then all data are 
down sampled, so that they are matched with the input of network. The imbalanced ratio of the data set 
generated is also random, resulting in additional diversity from which the ensemble can also benefit [33]. Given 
a dataset TR with N samples {x;, y;},i = 1,2,...,N, where x, is the sample in the m dimension feature space 


and the label of the class y; € C = {Yo, Y,}- x; are a random vector attributes x defined on R¢,with unknown 
probability density function f(x). Let N; be the number of samples belonging to class ¥;. First, random 
oversampling procedure chooses y* = Y; with probability 7;. Then, select {x;, yj} € TR, where y; = y* with 


probability 1/N ;- Lastly, sample x* from Ky (xi) where Ky; is probability distribution that centred at x; 


and covariance matrix H; [34]. 


2.6. Performance evaluation 

The machine learning models were evaluated using 10-fold cross validation technique. Cross 
validation was used to assess predictive models by dividing the original data into training and testing dataset 
for ten times. Typically, the data were divided in a ratio of 70:30. Although a universal guideline does not exist, 
the ratio of 70:30 are the most frequently for evaluation of predictive models [35]. In this study, seven distinct 
metrics were considered: balanced accuracy = (Sensitivity + Specificity)/2, accuracy = (TP + TN)/ 
(TP +FP+FN+TN), specificity = TN/(TN + FP), sensitivity = TP/(TP + FN), precision = TP/ 
(TP + FP), f — measure = (2 X Precision X Recall)/(Precision + Recall) and area under the curve 
(AUC) = (1+ TPR — FPR)/2 where TPR is true positive rate and FPR is false positive rate. These metrics 
were determined using different values given in the confusion matrix as shown in Table 2. 


Table 2. Confusion matrix for binary classification 


Predicted 
Clean Slightly Polluted 
Clean True Negative (TN) False Positive (FP) 


Actual Slightly Polluted False Negative (FN) True Positive (TP) 
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3. RESULTS AND DISCUSSION 
3.1. Imbalance ratio (IR) 

Imbalance ratio is the most common measure used to describe the extent of the imbalance of a dataset. 
It is defined as the number of majority class over the number of minority class [36]. The imbalance ratio in this 
study is 3.84 which means the data is moderately imbalanced. The imbalanced scenario between clean and 
slightly polluted classes are shown in Figure 1. 


Bar Graph for Imbalanced WQC 


Percent 


Clean Slightly Polluted 
wac 


Figure |. Bar graph for water quality classification 


3.2. Comparison of ensemble models and conventional machine learning 

This subsection presents the performance results of the five conventional machine learning and the 
two ensemble models without resampling the original data. Based on the output in Table 3, the performance of 
the two ensemble models which are RF and GB are better than the other conventional machine learning models 
in term of accuracy, f-measure and AUC. A clear superiority of GB model which uses ensemble boosting 
method over the other machine learning models. This is followed by RF which use ensemble bagging method. 
This means that boosting and bagging models have enhanced the performance of classifiers. 


Table 3. Performance metrics of conventional machine learning and ensemble models 
Accuracy Balanced Accuracy Sensitivity Specificity Precision F-measure AUC 


PES ge) (%) (%) (%) (%) (%) (%) 
KNN 90.82 82.15 67.50 96.79 84.38 75.00 89.86 
SVM 92.86 85.29 72.50 98.08 90.62 80.56 92.85 
ANN 93.37 89.33 82.50 96.15 84.62 83.54 93.85 

DT 86.22 80.19 70.00 90.38 65.12 67.47 87.76 
NB 90.31 85.54 77.50 93.59 75.61 76.54 92.52 
RF 93.88 88.72 80.00 97.44 88.89 84.21 98.27 
GB 94.90 89.36 80.00 98.72 94.12 86.49 98.61 


3.3. Comparison of performance metrics for all machine learning after resampling 

Next, this study compares the performance of the seven machine learning using ROS, RUS and hybrid 
sampling ROS-RUS. The method without resampling which means no established method of processing 
imbalance was also included as a baseline performance reference. Based on the output in Table 4, the best 
method was GB with Original data for almost all metrics except balanced accuracy and sensitivity, followed 
by RF with ROS-RUS, in term of accuracy, balanced accuracy, specificity, precision and f-measure. While, 
the method that showed the worst results was NB with ROS-RUS followed by DT with RUS. Focusing on each 
performance metric separately, the results of specificity and precision for some classifiers which are KNN, 
SVM and GB tend to reveal that it is better not to resample the data since resampling approach did not improve 
the classifiers. However, for classifiers like RF, ANN, DT and NB, the results were improved after resampling. 
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The results for sensitivity showed improvement for all classifiers except NB when using resampling dataset as 
shown in Figure 2. 


Table 4. Performance metrics for all machine learning after resampling using ROS, RUS and ROS-RUS 


Algorithm Sampling Accuracy Balanced Sensitivity  Specificit Precision | F-measure AUC 
(%) ___ Accuracy (%) (%) y (%) (%) (%) (%) 
KNN Without resampling 90.82 82.15 67.50 96.79 84.38 75.00 89.86 
ROS 89.29 81.19 67.50 94.87 77.14 72.00 81.19 
RUS 94.39 92.76 90.00 95.51 83.72 86.75 98.02 
ROS-RUS 90.82 86.79 80.00 93.59 76.19 78.05 86.79 
SVM Without resampling 92.86 85.29 72.50 98.08 90.62 80.56 92.85 
ROS 92.86 88.08 80.00 96.15 84.21 82.05 93.22 
RUS 86.22 83.91 80.00 87.82 62.75 70.33 90.88 
ROS-RUS 88.78 87.37 85.00 89.74 68.00 75.56 93.43 
ANN Without resampling 93.37 89.33 82.50 96.15 84.62 83.54 93.85 
ROS 95.92 93.72 90.00 97.44 90.00 90.00 97.84 
RUS 93.88 91.51 87.50 95.51 83.33 85.37 95.18 
ROS-RUS 90.31 87.40 82.50 92.31 73.33 77.65 89.04 
DT Without resampling 86.22 80.19 70.00 90.38 65.12 67.47 87.76 
ROS 87.76 83.94 77.50 90.38 67.39 72.09 92.14 
RUS 84.18 79.84 72.50 87.18 59.18 65.17 77.99 
ROS-RUS 86.22 81.12 72.50 89.74 64.44 68.24 78.82 
NB Without resampling 90.31 85.54 77.50 93.59 75.61 76.54 92.52 
ROS 86.73 82.37 75.00 89.74 65.22 69.77 90.11 
RUS 91.33 87.12 80.00 94.23 78.05 79.01 95.21 
ROS-RUS 83.67 79.52 72.50 86.54 58.00 64.44 88.80 
RF Without resampling 93.88 88.72 80.00 97.44 88.89 84.21 98.27 
ROS 91.84 87.44 80.00 94.87 80.00 80.00 97.12 
RUS 89.80 88.94 87.50 90.38 70.00 77.78 97.19 
ROS-RUS 94.39 89.97 82.50 97.44 89.19 85.71 97.58 
GB Without resampling 94.90 89.36 80.00 98.72 94.12 86.49 98.61 
ROS 93.37 89.33 82.50 96.15 84.62 83.54 98.38 
RUS 91.84 91.15 90.00 92.31 75.00 81.82 97.04 
ROS-RUS 92.86 90.87 87.50 94.23 79.55 83.33 97.28 
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Figure 2. Comparison of classifier performance by resampling method 


Moreover, it is worth noting the results of some conventional classifiers highlighted a better 
performance when resampling methods were used. Sensitivity improves for KNN (ROS-RUS and RUS), SVM 
(ROS-RUS-highest), ANN (ROS-highest), DT (ROS-highest) and NB (RUS), as shown in Figure 3. On the 
other hand, f-measure metric revealed that some resampling contributed to overcome the imbalance compared 
to without resampling. The improvement was also observed for ensemble classifier of RF. Sensitivity improves 
for both RF and GB, especially under RUS sampling method, as shown in Figure 4. 


Indonesian J Elec Eng & Comp Sci, Vol. 29, No. 1, January 2023: 598-608 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 0 605 


Performance of KNN Classifier Performance of SVM Classifier Performance of NB Classifier 
_ 100 _ 100 _ 100 
& 0 = 30 & 80 
g 3 & 60 
s 60 © 60 = 
e€ 40 € 40 = 40 
& 20 & 20 eu cee 
2 2 9 aes 
UO 
Ww 
a 
wn 
Performance Metric Performance Metric Performance Metric 
OORI_DATA &ROS f§RUS MROS-RUS OORI_DATA &ROS SRUS OROS-RUS OOORI_DATA & ROS (RUS HROS-RUS 
Performance of ANN Classifier Performance of DT Classifier 
100 _ 100 
& 80 & 30 
0) ov 
2 60 2 60 
= 40 = 40 
S 20 £ 20 
o i} oO 
ae 0 pis VEL o a 0 
ow wer 7, 2 
Be ug 
Performance Metric Performance Metric 
GORI_DATA ROS &)RUS HROS-RUS BHORI_DATA &ROS WRUS HROS-RUS 


Figure 3. Comparison of conventional classifier performance by resampling method 
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Figure 4. Comparison of ensemble classifier performance by resampling method 


4. CONCLUSION 


This paper illustrated the impact of using data-sampling approaches for developing predictive model 
for imbalanced water quality data. These approaches involve primarily the use of preprocessing techniques 
such as RUS, ROS and ROS-RUS (hybrid sampling) to transform an imbalanced dataset into a balanced 
dataset. The analysis was conducted to emphasize the effect of resampling techniques on the performance of 
two ensemble families: bagging (random forest) and boosting (gradient boosting). The ensemble boosting 
method, while it requires more computing power, has clearly outperformed the bagging method. Surprisingly, 
the training of the ensembles on the original dataset without any change offered quite good results overall, 
especially for gradient boosting. For resampling techniques, ROS generally performed better, but with minimal 
advantage, closely followed by RUS. A very interesting conclusion of the study is the importance of using 
different assessment metrics when addressing imbalance issues. This is preferable because every metric uses 
the values of the confusion matrix in a specific way and thus has its own strengths and weaknesses. Therefore, 
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the use of more than one measure provides a more informed view of the results and an improved assessment 
of a single classifier’s performance. 
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